The impacts of active and self-supervised learning on efficient annotation of single-cell expression data

A crucial step in the analysis of single-cell data is annotating cells to cell types and states. While a myriad of approaches has been proposed, manual labeling of cells to create training datasets remains tedious and time-consuming. In the field of machine learning, active and self-supervised learning methods have been proposed to improve the performance of a classifier while reducing both annotation time and label budget. However, the benefits of such strategies for single-cell annotation have yet to be evaluated in realistic settings. Here, we perform a comprehensive benchmarking of active and self-supervised labeling strategies across a range of single-cell technologies and cell type annotation algorithms. We quantify the benefits of active learning and self-supervised strategies in the presence of cell type imbalance and variable similarity. We introduce adaptive reweighting, a heuristic procedure tailored to single-cell data—including a marker-aware version—that shows competitive performance with existing approaches. In addition, we demonstrate that having prior knowledge of cell type markers improves annotation accuracy. Finally, we summarize our findings into a set of recommendations for those implementing cell type annotation procedures or platforms. An R package implementing the heuristic approaches introduced in this work may be found at https://github.com/camlab-bioml/leader.


S. Figure 2. Active learning algorithms work as intended.
A-C) Accuracy of the active learning classifier measured using the average F1-score across 10 seeds as a function of the active learning iteration.As a control, all training labels were corrupted (purple).The columns show the results for each active learning setting, while the rows show how the initial set of cells was selected (randomly or ranked) and what active learning model was used (random forest, rf or logistic regression, multinom).The same plot is shown for each cohort: A) CyTOF -Bone S. Figure 4. Performance of active learning methods by active learning model.Shown are the five accuracy measures across all ten train test splits for the CyTOF -Bone marrow, scRNASeq -Breast cancer cell lines and snRNASeq -Pancreas cancer cohorts and selected dataset size coloured by the active learning model used.A) Results when the initial set of cells were selected randomly.B) Results when the initial set of cells was selected by ranking their expression.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.random forest self-trainers for the scRNASeq liver and vasculature, and snRNASeq pancreas cancer datasets.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.
S. Figure 34.Self-training performance boost for the CyTOF Bone marrow cohort.Shown is the improvement in F1-score when self-trained data is included in the dataset.The columns depict the selection method for the initial dataset, while the rows depict the number of cells annotated with ground truth values and the percentage of most confidently labeled cells included.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.The columns depict the selection method for the initial dataset, while the rows depict the number of cells annotated with ground truth values and the percentage of most confidently labelled cells included.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.when self-trained cell type labels are included and excluded for each cell type assignment method using a different number of cells in the initial self-training dataset.The panels are faceted by the number of self-trained cells included, e.g.10% equates to including the 10% most confidently labeled cells in the training dataset.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.
obtained when randomly selecting the initial set of cells Performance of selection methods comparing random forest and logistic regression active learning strategies obtained when ranking the initial set of cells Performance of selection methods comparing random forest and logistic regression active learning strategies B Active learning algorithm LR RF S. Figure 5. Performance of active learning methods by active learning model.Same as S.

Figure 4
Figure4for the scRNASeq -Liver, scRNASeq -Lung cancer cell lines and scRNASeq -Vasculature datasets.A) Results when the initial set of cells were selected randomly.B) Results when the initial set of cells was selected by ranking their expression.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.
13. Effect of dataset imbalance on balanced accuracy.Shown is the change in balanced accuracy (calculated as accuracy in imbalanced dataset -accuracy in balanced dataset / accuracy in balanced dataset).Each figure is faceted by the active learning model used (LR or RF), the cell selection method for the first 20 cells and the cell type prediction method.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.14.Effect of dataset imbalance on classification accuracy.Same as S. Figure13for the tabula vasculature and liver atlas datasets.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.
28.Effect of expanded dataset imbalance on classification accuracy for theCyTOF -Bone marrow, scRNASeq -Breast cancer cell line and snRNASeq -Pancreas cancer datasets.Shown is the change for each metric (calculated as accuracy in imbalanced dataset -accuracy in balanced dataset / accuracy in balanced dataset).Each figure is faceted by cohort.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.29.Effect of expanded dataset imbalance on classification accuracy for the scRNALung and tabula datasets.Shown is the change for each metric (calculated as accuracy in imbalanced dataset -accuracy in balanced dataset / accuracy in balanced dataset).Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.30.Number of times a selection procedure is the top performing method.The average improvement score (as calculated for S. Figures8 and 9) is calculated for each method, selection procedure, dataset and metric.The number of times each selection procedure is the best performing (top) or among the best 3 performing (bottom) is shown for each metric.The selection procedures are ordered by the average number of times each method is among the top performing group.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.S. Figure31.Effect of removing cells using a random forest classifier.Shown are entropy values for all cells not in the initial training set of 20 cells.The values were calculated using the predicted probabilities from the active learning classifier after it was trained on the initial dataset.Boxplots are filled by the number of cells present of a particular type (shown in the plot title), while the x axis shows the ground truth cell type label.A) scRNASeq dataset with a random forest model.B) CyTOF dataset using a random forest model.As entropy is bounded by the total number of classes, the entropy values depicted were scaled by the maximum possible value for each experiment.Shown are the results across the 10 different train test splits.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.
35.Self-training performance boost for the scRNASeq breast cancer cell line cohort.Shown is the improvement in F1-score when self-trained data is included in the dataset.
Breast cancer cell lines S. Figure36.Self-training performance boost for the snRNASeq pancreas cancer cohort.Shown is the improvement in F1-score when self-trained data is included in the dataset.The columns depict the selection method for the initial dataset, while the rows depict the number of cells annotated with ground truth values and the percentage of most confidently labelled cells included.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.37. Self-training performance boost for the scRNASeq lung cancer cell line cohort.Shown is the improvement in F1-score when self-trained data is included in the dataset.The columns depict the selection method for the initial dataset, while the rows depict the number of cells annotated with ground truth values and the percentage of most confidently labeled cells included.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.Lung cancer cell lines S. Figure38.Self-training performance boost for the scRNASeq liver cohort.Shown is the improvement in F1-score when self-trained data is included in the dataset.The columns depict the selection method for the initial dataset, while the rows depict the number of cells annotated with ground truth values and the percentage of most confidently labeled cells included.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.39.Self-training performance boost for the scRNASeq vasculature cohort.Shown is the improvement in F1-score when self-trained data is included in the dataset.The columns depict the selection method for the initial dataset, while the rows depict the number of cells annotated with ground truth values and the percentage of most confidently labeled cells included.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.S.Figure 40.The effect of self-training is limited to scenarios with a low number of training cells.Shown is the absolute change in F1-score between the classifier performance

Figure 12. Runtime analysis for all selection methods benchmarked.
Shown is the runtime in seconds for each selection method and dataset for each of the ten train test splits.All active learning methods were trained using a random set of 20 initial cells.Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range.Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.